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Abstract — In the present day digital world, it is imperative that all organizations and enterprises facilitate efficient 
processing of queries on XML data. XML queries typically specify patterns of selection predicates on multiple elements 
that have specified tree structured relationships. The primitive tree -structured relationships are parent-child and ancestor- 
descendant. Finding all occurrences of these relationships in an XML database is a core operation for XML query 
processing. In this paper the pattern matching algorithms TwigStack and TwigStackList are discussed. The behavior of 
TwigStack is analyzed, and a comparison of these two algorithms is attempted. The TwigStack algorithm the initial 
holistic algorithm, has features of performing simultaneous scan over streams of XML nodes to match their structural 
relationships holistically, reducing a number of unnecessary intermediate results, and skipping XML nodes that will not 
contribute to final answers. The family of holistic pattern matching algorithms has appeared as the major important 
algorithms for processing XML query patterns due to its efficiency and performance advantage. The experimental results 
show that the query performance is significantly improved especially for queries having relatively more complex structures 
and/or higher selectivities. 
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I. INTRODUCTION 

XML employs a tree-structured model for representing data. In Xml, XPath and XQuery [1] are used for 
addressing the parts of an xml document and for specifying patterns of selection predicates on multiple elements that have 
specified tree-structured relationships. For example, the XQuery path expression 

Book [author=suciu] // [title=XML] 

An XML tree pattern query, represented as a labeled tree, is essentially a complex selection predicate on both 
structure and content of an XML. Tree pattern matching has been identified as a core operation in querying XML data. The 
data in Xml is arranged by using the grammar DTD (Document Type Definition) fig 2. In web mining, the data is retrieved 
from web through XML tree. The XML tree gives all relevant information to the users of the web. Xml allows for 
structuring of data on the web. The structure of XML data is represented in fig 1. An XML document is made of elements 
limited by tags and is hierarchically structured. 

II. BACKGROUND 

The extensible markup language XML has recently emerged as a new standard for information representation and 
exchange on the Internet. XML allows users to make up any new tags for descriptive markup of their own applications. Since 
XML data is self-describing, XML is considered one of the most promising means to define semi-structured data, which is 
expected to be ubiquitous in large volumes from diverse data sources and applications on the web. In Xml Tree there is a 
Parent-Child (P-C) and Ancestor and Descendant (A-D) relationships which are represented as / and // in fig 3. A tree which is 
maintained by both Parent-Child (P-C) and Ancestor and Descendant (A-D) relationships is presented in fig.3. There are some 
pattern matching algorithms [4] [5], which are not much efficient than TwigStack [2]discussed in m. TwigStack 
implementation is discussed in section rV. To overcome some limitations of TwigStack a TwigStack List is discussed in 
section V. Section VI concludes the paper. TwigStack [2] is one of the pattern matching algorithms, which can efficiently 
retrieve information much faster than many other algorithms [4] [5]. TwigStack [2] is optimal for tree pattern queries with only 
A-D edges. In other words, TwigStack [2] processes the tree pattern holistically without decomposing into several small 
binary relationships. TwigStack [2] guarantees that there is no useless intermediate result for queries with only Ancestor- 
Descendant (A-D) relationships. 
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Author title chapter 

I f\ 

Sucfu title section ".„" title section 

XML title text title text 

Fig 1, An XML Tree Representation 

Algorithm TwigStack operates in two phases. In the first phase (lines 1-11), some (but not all) solutions to 
individual query root-to-leaf paths are computed. In the second phase (line-12), these solutions are merge-joined to compute 
answers to the query twig pattern as delineated in fig 4. 

< ELEMENT bib (book*)> 

< ELEMENT book (author+, title, chapter*)> 

< ELEMENT author (#PCDATA)> 

< ELEMENT title (#PCDATA)> 

< ELEMENT chapter (title, section*)> 

< ELEMENT section (title, (text I section)*)> 

< ELEMENT text (#PCDATA I bold I keyword I emph) *> 

< ELEMENT bold (#PCDATA I bold I keyword I emph )*> 

<!ELEMENT keyword (#PCDATA I bold I keyword I emph )*> 

< ELEMENT emph (#PCDATA I bold I keyword I emph )*> 

Fig 2, An DTD For XML Data 



//Phase 1 




1 


while ~end(q) 


2 


q act = getNext(q) 


3 


If(~isRoot(q acl )) 


4 


cleanStack(parent(q act ), nextL(q act )) 


5 


If(isRoot(q act ) V ~empty(S parent(q act ))) 


6 


cleanStack (q act , next(q act )) 


7 


moveStreamToStack (T qact ,S qact , pointer to 




tOp(S paren t(q act) )) 


8 


if (isLeaf(q act )) 


9 


showSolution WithBloacking(S qact ,l) 


10 


Pop(Sq act ) 


11 


else advance(T qact ) 


//Phase 2 




12 


mergeAll PathSolutions() 


Function getNext(q) 


1 


if (isLeaf(q)) return q 


2 


for qj in children(q) 


3 


npgetnext(q ; ) 


4 


If(nj isnotEqualto q ; ) return n ; 


5 


n mill = minarg n ; , nextL(T ni ) 


6 


n max = maxarg n; , nextL(T ni ) 


7 


while (nextR(Tq) < nextL(T n max )) 


8 


advance(Tq) 


9 


If (nextL(Tq) < nextL(T n min )) return q 


10 


Else return n min 
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Procedure cleanStack(S, actL) 

1 while (~empty(S) and (topR(S) <actL)) 

2 pop S. 



Fig 4, TwigStack Algorithm. 



III. 



TWIGSTACK IMPLEMENTATION 



TwigStack [2] 1) avoids generating large intermediate results Which do not contribute to the final answer, 2) 
avoids unnecessary scanning of source documents, 3) avoids unnecessary scanning of irrelevant portions of XML 
documents. For example, the query is /library/category[@name=Frawce]/book/title[@language=£'ngfo/!] 

At each node a stack is maintained by the TwigStack [2] algorithm. A diagrammatic representation of the 
processing of a query is made in fig.5. And how the data is arranged in the stack in each and every node is presented in fig 6. 

XPath Query String 

Eleml/elem2{@argl=[text][/elem3=text; 



Query Tree 

@argl 
textl 




Query Tree With Metadata 



(8,3,3)... Q G (4,6,3). 



1. -tw^-^vJ > — ' _ ^— * "■ WvVVv' 



Phase 1 




1 



Path List=Intermediate. result 



Phase:2 



TreeList=result. 




Fig 5, TwigStack implementation. 
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In fig 5, the TwigStack algorithm comprises two tasks. The first task is to perform query pattern matching against 
XML data and to generate partial solutions. Meanwhile, the second task is to merge the partial solutions generated by the first 
task for final solutions. 

library. . category! '@ narne= ISHE£j book title[@language=English] 
Same XML tag Can be nested 



r 



j 



j 



<libiary L > 
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Library category boot 
Query Node Stack 



titl= 



'Snam? 'SlanEuaE:? Endish 



Fig 6. Arranging of data in stack at each node. 



The values of a stack for a query are shown in fig.6 Query Node Stack. The limitations of TwigStack [2] Algorithm 
are redundancy is maintained, retrieving of data through XML is not much faster than TwigStackList [3], the efficiency of 
retrieving large queries in XML data is not effective and the intermediate results are not reduced. 

IV. TWIGSTACKLIST 

TwigStackList [3] is combination of TwigStack [2] and Lists. It improves efficiency of large queries on XML data 
and overcomes the limitations of redundancy in TwigStack [2]. The tree structure of XML data using TwigStackList [3] is 
shown in fig 7. At each node the stack and lists are maintained. 



u © 



u 



I III L 

1 - 



© 

u 



' ll» L M 



Fig 7, TwigStackList [3] where 'Sn' stands for Stacks and 'Ln' for Lists. 
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TwigStackList [3] operates in two phases. In the first phase (line 1-11), it repeatedly calls the getNext algorithm 
with the query root as the parameter to get the next node for processing. We output solutions to individual query root-to-leaf 
paths in this phase. In the second phase (line 12), these solutions are merge-joined to compute the answer to the whole query. 
The getNext algorithm is presented in fig 8 and TwigStackList [3] Algorithm in fig 9. 

Book 
Title 
XML 



Fig 3, XML tree with A-D and P-C relationships. 

At line 2-5, in Algorithm getNext, we recursively invoke getNext for each w, 2 children (n). If any returned node g t 
is not equal to n, , we immediately return g, (line 4). Line 6 and 7 get the max and min elements for the current head elements 
in lists or streams, respectively. Line 8 skips elements that do not contribute to results. If no common ancestor for all C n, is 
found, line 9 returns the child node with the smallest start value, i.e. gmin . Line 10 is an important step. Here we look-ahead 
read some elements in the stream Tn and cache elements that are ancestors of Cnmax into the list Ln. Whenever any element 
tij cannot its parent in list Ln for w, 2 children(n), algorithm getNext returns node w, (in line 17). In TwigStack[2] , getNext(n) 
return nO if the head element enO in stream TnO has a descendant e w, in each stream Tn t , for ni 2 children(nO) and 
getNext(root) in TwigStackList [3] returns b\. 

By using TwigStackList Algorithm, we can reduce the intermediate results of a query on xml data, and thereby 
reduce the redundancy level in TwigStack[2]. 

Algorithm 1 getNext(n) 

1 If isLeaf(n) retun n 

2 For all node n, n children(n) do 

3 gi = getNext(n ; ) 

4 If (gj isNotEqualTo n ; ) return g ; 

5 End for 

6 n max =maxargn ; ^ children(n) getStart(ni) 

7 n mill = minarg n ; <_ children(n) getStart(ni) 

8 while ( gefEnd(n) < getStart(nmax)) proceed(n) 

9 if ( getStart(n) > getStart(nmin)) return n min 

10 MoveStreamToList(n, n max ) 

1 1 For all node nj in PCRchildren(n) do 

12 If ( there is an element e ; in listLn such that e, is the parent of getElement(ni) ) then 

13 If(nj is the only child of n) then 

14 Move the cursor p n of list Ln to point to e ; 

15 end if 

16 End for 

17 Return n 

Procedure getElement(n) 

1 . If -empty(Ln) then 

2. return Ln.elementAt(p n ) 

3. Else return en 

Procedure getStart(n) 

1 . return the start attribute of getElement(n) 

Procedure getEnd(n) 

1 return the end attribute of getElement(n) 

Procedure MoveStreamToList(n,g) 

1 while Cn. start < getStart(g) do 

2 if Cn.end > getEnd(g) then 

2 Ln.append(Cn) 

3 end if 

4 advance(Tn) 
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5 end while 

Procedure proceed(n) 

1 if empty(Ln) then 

2 advance(Tn) 

3 else 

4 Ln.delete(Pn) 

5 Pn =0 {Move pn to the point to the beginningof Ln} 

6 End if 

fig 8, getNext algorithm 
Algorithm 2 TwigStackList 

1 While ~end() do 

2 n acl = getNext(root) 

3 If (~isRoot(n act )) then 

4 cleanparentStack(n act , getStart(n act )) 

5 end if 

6 if (isRoot(n act )V~empty( Sparent (n act )) then 

7 clearSelfStack(n acb getEnd(n acl )) 

8 moveToSatck(n ac t,S nact ,pointertotop(S paren t(n a ct)) 

9 if (isLeaf(n act ) then 

10 showSolutionsWithBloacking(S nacb l) 

1 1 pop(S nact ) 

12 endlf 

13 else 

14 proceed(n acl ) 

15 endif 

16 end while 

17 mergeAllPathSolutions 

Function end() 

1 return nj subtreeNodes(n): isLeaffn;) and endC(n;) 

Function moveToStack(n, S n ,p) 

1 push (getElement(n),p) toStack S n 

2 proceed(n) 

Procedure clearparentSatck(n, actStart) 

1 while(~emptyS parcnt(n) ) 
A topEnd(S paren t( n ))<actStart)) do 

2 pop(S parcnt ( n )) 

3 end while 

Procedure clearSelfStack(n, actEnd) 

1 while (~empty(S n ) and topEnd(S n )<actEnd) do 

2 pop(S n ) 

3 end while. 

Fig 9, TwigStackList Algorithm 
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FiglO, Example TwigQuery and Documents 

TwigStack [2] pushes c\ to stack Sc and outputs two \useless" intermediate path solution <a\; b\> and <a\; cl; 
d\; f\>. The behavior of TwigStack[2] is also reasonable because based on region coding of gl, one cannot decide 
whether gl has the parent tagged with e. TwigStackList[3] does not hastily push cl to stack , but first checks the parent- 
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child relationship between el and gl. If el is not the parent of gl, then TwigStackList[3] caches el in a list and reads 
more elements in Te. In this simple case, el is the only element in stream Te 

V. CONCLUSION 

The XM1 tree construction and importance of pattern matching algorithms for searching the data is discussed. The 
TwigStack Algorithm has a Time complexity but, the limitation is space complexity. How the TwigStackList overcomes the 
limitations of TwigStack in reducing the intermediate results in a query on XML data has been elaborated upon. Our 
experiments have demonstrated that these pattern matching algorithms have an edge over other pattern matching algorithms 
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